Important Features PCA for high dimensional clustering
نویسندگان
چکیده
We consider a clustering problem where we observe feature vectors Xi ∈ R, i = 1, 2, . . . , n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. We propose Important Features PCA (IF-PCA) as a new clustering procedure. In IFPCA, we select a small fraction of features with the largest Kolmogorov-Smirnov (KS) scores, where the threshold is chosen by adapting the recent notion of Higher Criticism, obtain the first (K − 1) left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical k-means to these singular vectors. It can be seen that IF-PCA is a tuning free clustering method. We apply IF-PCA to 10 gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only 29% or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by [16] on microarray data. With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov-Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.
منابع مشابه
High-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملPhase Transitions for High Dimensional Clustering and Related Problems
Consider a two-class clustering problem where we observe Xi = `iμ + Zi, Zi iid ∼ N(0, Ip), 1 ≤ i ≤ n. The feature vector μ ∈ R is unknown but is presumably sparse. The class labels `i ∈ {−1, 1} are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we fin...
متن کاملInfluential Features Pca for High Dimensional Clustering
We consider a clustering problem where we observe feature vectors Xi ∈ R, i = 1, 2, . . . , n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select...
متن کاملImproved Cluster Partition in Principal Component Analysis Guided Clustering
Principal component analysis (PCA) guided clustering approach is widely used in high dimensional data to improve the efficiency of Kmeans cluster solutions. Typically, Pearson correlation is used in PCA to provide an eigenanalysis to obtain the associated components that account for most of the variations in the data. However, PCA based Pearson correlation can be sensitive on non-Gaussian distr...
متن کاملFeature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کامل